Predicting Income Levels Using the Adult Census Dataset
from ucimlrepo import fetch_ucirepo import pandas as pdfrom sklearn.model_selection import train_test_splitimport numpy as npimport matplotlib.pyplot as pltimport altair as altalt.data_transformers.enable("vegafusion")
DataTransformerRegistry.enable('vegafusion')
Summary
In this project, we examine whether demographic and employment-related factors can be used to predict if an individual’s annual income exceeds $50K, using the Adult Census Income dataset. After cleaning the data and preparing the features, we compared several classification models, including logistic regression, SVM, and random forest. Logistic regression provided the strongest baseline performance, and after further tuning, the final model reached an accuracy of 0.80 and a weighted F1 score of 0.83 on the test set. While the model performs well on the majority class, it continues to face challenges in identifying higher-income individuals due to class imbalance. Overall, our findings suggest that income prediction is feasible, though additional techniques—such as resampling or more advanced models—may help improve performance on the minority class.
Introduction
Income inequality has reached a record high in Canada within the recent years (Statistics Canada, 2025). Increase in Canada’s wealth gap has been shown to perpetuate poverty cycles, affect economic mobility, and limit socioeconomic opportunities (Connoly and Haeck, 2024). Understanding the individual factors associated with income level could help governments and policy makers better address these issues.
In this project, we seek to build a machine learning model capable of accurately predicting whether an individual’s income exceeds 50,000. To do such classification, our model will consider factors such as age, education level, occupation, marital status, and gender just to name a few. We believe an income classification model to be valuable because of its ability to identify broad patterns, and highlight features that contribute most to high income status. These insights could also shed light on structural inequalities, economic trends, and educational impacts. Although a human could perform this task reasonably accurately, a machine learning model enables scalability, and limits the biases present in human decision making. Therefore, our goal is to build a model that uses relevant features to predict income status, and evaluate how it performs on such tasks.
Methods
Data
The data set used in this project is the Adult census income dataset. It was originally extracted from the 1994 U.S. Census database by Barry Becker. In 1996, the dataset was donated to the UCI Machine Learning Repository with the collaboration of Ronny Kohavi (Becker and Kohavi 1996), the dataset can be found here: https://archive.ics.uci.edu/dataset/2/adult. The dataset contains 48,842 rows and 14 columns. We split the data into a 40% training set and a 60% test set. Each row in the data set represents an individual’s personal and demographic information, including the income class label (whether their annual income exceeded $50K or not) and attributes such as age, education level, occupation, and hours per week.
To examine whether each predictor may help discriminate between the two income groups (<=50K vs. >50K), we first explored the numeric and categorical features separately.
adult = fetch_ucirepo(id=2) X = adult.data.features y = adult.data.targets
By checking the summary table below, we observe that there are several missing values in features such as workclass, occupation, and native-country. These missing values will be handled during data preprocessing using an appropriate imputation strategy.
Additionally, the target variable income contains inconsistencies in spelling (e.g., ‘<50K’ vs. ‘<50K.’), so we cleaned these values to ensure a consistent label representation.
summary = pd.DataFrame({"unique value count": train_df.nunique(),"null count": train_df.isnull().sum()}).Tsummary
According to the summary table, the numeric features vary widely in both range and scale. For example, hours-per-week ranges from 1 to 99, whereas capital-gain ranges from 0 to 99,999. This large difference in magnitude makes it necessary to apply a StandardScaler to normalize the features so that models relying on distance can compare them fairly.
Additionally, some variables such as capital-gain and capital-loss are extremely skewed, with most values concentrated near zero. We are going to the compare the distribution of the numeric features differ between the two target classes income <=50K and >50K.
Age: The two income groups differ clearly in their age distributions. Individuals earning >50K tend to be older, most commonly between ages 40 and 55, while the <=50K group is more spread out and skewed younger. This suggests that higher income may be linked to greater work experience.
Education-num: Individuals earning >50K generally have higher education levels, indicating that formal education is an important factor associated with income.
Capital-gain: The distribution is strongly right-skewed, with most values at zero for both groups. Non-zero values appear more often in the >50K group, and there is an extreme outlier near 100,000.
Capital-loss: The pattern is similar to capital-gain, with almost all values at zero and slightly more non-zero observations in the >50K group.
Hours-per-week: Individuals earning >50K tend to work longer hours, showing a clear shift toward extended working time.
In the Spearman Correlation Bubble Chart for numeric features, we can see that none of the numeric variables are strongly correlated with each other. The highest correlation is between education level and hours per week, at approximately 0.14, which is still very weak. Therefore, multicollinearity is not a concern for these features.
Categorical Features
For categorical features, we produced normalized stacked bar charts to visualize how proportions of each income group differ across categories.